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Abstract 

Statistical consistency in phylogenetics has traditionally referred to the ac- 
curacy of estimating phylogenetic parameters for a fixed number of species 
as we increase the number of characters. However, as sequences are often 
of fixed length (e.g. for a gene) although we are often able to sample more 
taxa, it is useful to consider a dual type of statistical consistency where we 
increase the number of species, rather than characters. This raises some basic 
questions: what can we learn about the evolutionary process as we increase 
the number of species? In particular, does having more species allow us to 
infer the ancestral state of characters accurately? This question is partic- 
ularly relevant when sequence site evolution varies in a complex way from 
character to character, as well as for reconstructing ancestral sequences. In 
this paper, we assemble a collection of results to analyse various approaches 
for inferring ancestral information with increasing accuracy as the number 
of taxa increases. 
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1. Introduction 



As Elliott Sober discussed two decades ago [151 ]. there is a fundamental 
asymmetry between reconstructing a past state from a present observation, 
and predicting its future state. Moreover, this holds even when the state 
evolves according to a time-reversible process (processes which, when they 
are in equilibrium, behave the same whether run forward or backward in 
time). For instance, consider any continuous Markov process on two states, 
with arbitrary transition rates (generally unequal) between the two states. 
If we observe the state of the process at the present time t, then the 'best' 
estimate of the initial state at time is always the present state, but the 
'best' estimate of its state at some future time t' > t depends on the actual 
transition rates (which may be unknown) (l5| . 

When we move beyond two states in a Markov process, the current state 
is no longer guaranteed to always be the 'best' estimate of the ancestral state, 
even for reversible processes, as we describe below. Ancestral state estimation 
assumes a further dimension when we move from the linear evolution of a 
state through time to the bifurcating evolution of states in a tree that results 
in their observed values at the leaves. The presence of many leaves helps us to 
estimate the ancestral state more accurately, but these leaves do not provide 
independent information about the root state due to correlations arising from 
the partial overlap of the paths in the tree as one moves from the root to the 
leaves. The mathematical, statistical and computational aspects of ancestral 
state estimation on a tree have been explored by a number of authors (e.g. 
|sl EH 12, 13, 14, ijj n and the inference of ancestral states is an important 



question in biology 

Our interest here is in site-specific models. These are especially relevant 
with proteins, where each site has specific biochemical constraints (e.g. small 
and hydrophobic, aromatic, helix-former, etc). As we are interested in site- 
specific models, the details of the substitution model are mostly unknown. 
For example, the relative or absolute branch length may not be known ex- 
actly, though we may have some upper bound on them. Also, the equilibrium 
frequencies at the site may not be known. This is the case in the CAT model 
for proteins (Q; see also 0]). This model is a mixture of F81-like models, 
where each site follows a Poisson model with specific character frequencies 
defined by the biochemical constraints acting on that site. However, we shall 
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see that dealing with unknown equilibrium frequencies imposes strong limi- 
tations when the aim is to estimate ancestral characters, especially when the 
branch lengths are unknown. Thus, we will also envisage special cases where 
equilibrium frequencies are known or even all identical. 

In most cases (e.g. when the branch lengths are unknown), we are thus 
unable to use standard likelihood calculations based on the pruning algo- 
rithm to compute the most likely character at the tree root. Thus, we will 
discuss and study simple decision rules to predict the state at the tree root. 
Parsimony is an example of such a rule, where the branch lengths are use- 
less. Another example is the majority rule that involves selecting the state 
that is most frequent at the tree leaves to estimate the root state. For mod- 
els in which the equilibrium frequencies are not uniform across states, more 
complex inference rules are required. We shall see that under suitable as- 
sumptions on the tree topology and branch lengths and/or on the model, 
these simple rules are statistically consistent as the taxon sampling density 
becomes sufficiently large. 

We treat four general cases, each depending on the properties of the 
model. We start with the simplest (symmetric) model, then consider two 
overlapping generalizations ('monotone' and 'conservative') and finally we 
deal with the general model, for which stronger assumptions on the tree are 
required. 

1.1. Preliminaries 

Consider a rooted phylogenetic tree T (possibly non-binary) with n leaves 
and a set S of possible states that each vertex can be in. For a single- 
site assignment of states at the leaves of T, assume that the assignment 
has evolved under a GTR (general time-reversible) model from a particular 
character state so at the root, with a normalized rate matrix Q = IIS (where 
II = diag(7r) contains the equilibrium frequencies, and S is a symmetric 
matrix of 'exchangeabilities'). The process acts on each edge e according to 
some associated branch length l e . A more general version of this question 
is when a single site is replaced by a (possibly short) sequence (of length 
k). Assuming independent site evolution, the problem of ancestral state 
estimation remains the same (i.e. each site is solved independently). 
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We assume that T and S (and perhaps tt) are given, and, in addition, 
we may either know l e or have some bounds on them (e.g. the sum of the 
lengths from the root to any tip is, at most, some given value of t). We would 
like to use this input to estimate the ancestral state s e S at the root of the 
tree. The ability to estimate s accurately depends on a tradeoff between 
what we know about the underlying parameters (e.g. the site rate parameter 
H, the branch lengths l e , and the properties of Q such as the equilibrium 
distribution n) and how 'well behaved' the underlying Markov process is. 

In particular, we seek a method M of estimation that is statistically con- 
sistent in the following sense: As n becomes large, and given increasingly 
tight constraints on the tree, its branch lengths or the model, we want the 
accuracy of M (the probability that M reconstructs the ancestral state cor- 
rectly) to converge to 1. 

A natural choice of such a method, when Q is completely specified (in- 
cluding the equilibrium distribution n) and the branch lengths (l e ) are also 
known exactly, is to take the maximum posterior probability (MPP) ances- 
tral state (this selects the state with the largest posterior probability; the 
MPP method can be shown to confer the largest expected reconstruction 
probability amongst all methods). Note that for a symmetric model the 
MPP estimate of the root state is the same as the maximum likelihood (ML) 
estimate, but in general the two approaches differ (because the prior distri- 
bution of the states at the root multiplies these ML values by the prior in 
the MPP approach). 

When the whole model is (partly) unknown, the ML and MPP approaches 
may no longer be feasible. But in these cases, simpler approaches exist. For 
example, for a simple symmetric model (e.g. Jukes-Cantor) and a star tree 
with unknown branch lengths that are bounded above (l e < I < oo), we can 
estimate the ancestral state accurately by selecting the majority state (the 
consistency of this approach is justified by large deviation theorems for sums 
of independent random variables). 

However, even for symmetric models, it is clear that simply allowing n 
to grow is not sufficient to allow for accurate inference of the ancestral state 
s ; for example, we could have just two long edges incident with the root, 
and lots of very short edges that join the other endpoints of these edges to 
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numerous taxa. In this case, the tree basically behaves as a two-taxon tree 
and we have little information on the root when the two branches become too 
long. Thus we seek relevant and reasonable constraints on the distribution 
of l e values for this accurate estimation to be possible. One possibility for 
generating taxon-dense trees is to evolve a Yule (pure birth) tree of total 
height t and to select a large speciation rate A (we may then re-scale the rate 
on each edge by some bounded multiplicative factor to allow for violation of 
a strict molecular clock). 

Moving away from symmetric models, selecting the majority state at the 
leaves as an estimate of the ancestral state is not generally a sound strategy, 
even for a star tree, since the process after a long period of time will favour 
the state with the highest equilibrium frequency, regardless of the state at 
the root. 



2. Case I: Root state estimation without detailed knowledge of l e 
under a symmetric Poisson model 

Under the symmetric r-state Poisson model, the maximum likelihood esti- 
mate of the root state, in the case where the branch lengths (l e ) are unknown 
and are regarded as nuisance parameters to be optimized, is the maximum 
parsimony (MP) estimate (Theorem 6 of 2l|). In this setting, we can reliably 



estimate the root state, provided the taxon sampling is sufficiently dense that 
no edges are too long. This was suggested by the simulations in and we 
establish two formal results now for the case when r = 2. 

Proposition 2.1. Consider any rooted binary phylogenetic tree T . Evolve 
a single site under the two-state symmetric model. Let l + be the maximum 
branch length over all edges. Provided that l + < |log(|) ; the probability P* 
that the maximum parsimony (MP) reconstruction of the root state is the 
true state (toss a fair coin if the two states are equally favored) satisfies: 



P* > 1 - 31 



+ ■ 



Proof. When l + satisfies satisfies the bound described then, for each edge 
e of T the probability that the endpoints of edge e are in different states 
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p(e) = |(1 — e 2le ) satisfies the inequality p(e) < |. It then follows from part 
(ii) of Lemma 5.1 of 20 , that: 



where: 



P* > — u A 
r — 2 f 



V(l-4^)(l-8^) 



9 ~ 2(1-2^)2 ' 
and where g = max e {p(e)}. The result now follows from the inequalities: 



^ . — Nn — > -(1 — 6g), and q < l+. 
2(l-2^) 2 - 2 V yh y ~ + 



□ 



Unfortunately, in a Yule tree of fixed height, the expected value of l + 
does not converge to zero as the speciation rate A tends to infinity. This 
may seem surprising, since each edge in the tree converges in length to 
as A grows; however, the expected number of edges increases with A, and 
the probability that at least one of them is 'long' turns out to be positive. 
Simulations suggest that the expectated value of /+ converges to a value close 
to 60% of the height of the tree; the following result, the proof of which is 
provided in the Appendix, establishes a smaller lower bound. 

Proposition 2.2. Suppose a random rooted binary tree T\ is generated by 
a Yule (pure birth) process with speciation rate A acting for time t. Let 
l + = denote the length of the longest edge in T. Then E[/ + (A)] does 

not converge to as A — > oo. 

Thus we cannot directly apply Proposition 12.11 to Yule trees. Neverthe- 
less, we can precisely determine the probability with which MP will correctly 
reconstruct the root state of a Yule tree under a symmetric Poisson substitu- 
tion model on two states. In particular, provided the speciation rate passes 
a critical threshold (six times the substitution rate), then even for large trees 
where many leaves are far from the root, ancestral reconstruction is feasible. 
Moreover, as the ratio of speciation rate to substitution rate tends to infinity, 
we can correctly infer the root state with probability tending to 1. 
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2.1. MP root estimation for Yule trees under the two-state model 

Consider a pure-birth Yule tree that starts with a single (root) lineage at 
time and is grown until time t, with speciation rate A. Suppose we also 
have a binary character that evolves from some ancestral state at the root 
of the tree towards the leaves by undergoing substitution along the edges of 
the tree at rate \x according to a symmetric Markov process. Thus, we have 
a random tree (with a random number of leaves at time t) and a random 
binary character observed at the leaves. Let P t denote the probability that 
the maximum parsimony estimate for the state at the root of the tree, derived 
from the observed states at the leaves of the tree at time t, matches the true 
root state (in the case that both states are equally parsimonious at the root, 
select one state with equal probability). 

Theorem 2.3. (i) If A > Qp, then for all t > 0: 



where p = p/\. In particular, P t — > 1 as p — > 0. Moreover, P t is 
monotone decreasing in t with limit: 



Proof. Let and 1 denote the two states that undergo substitution on the 
Yule tree. Since the Markov process is symmetric we may suppose, without 
loss of generality, that is the initial character state at time t = 0. From 
the (random) evolved states on the leaves, estimate the root state using 
the maximum parsimony criterion (i.e. select the root state that minimizes 
the total number of substitutions required to describe the evolution of the 
character on the tree). There may be a unique reconstructed root state 
(which may be the same or opposite to the true initial state) or both states 
may be equally parsimonious. Let St (resp. D t ) be the probability that 
(resp. 1) is the unique most parsimonious root state reconstructed from the 





(ii) If A < 6/i we have: 
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Figure 1: The limiting value Hindoo P t as a function of p for p < 1/6. 

observed states at the leaves. Let E t = 1 — S t — D t be the probability that 
both states are equally parsimonious. We have: 

Pt = St + \E t = l -+ l -{S t -D t ). (1) 

We can generate a system of non-linear first-order differential equations for 
(St, D t , E t ) as follows. Consider that in the first 5 period of time, the root 
lineage can either: 

• persist, without a substitution occurring, 

• persist, with a substitution occurring, or 

• it can speciate into two lineages. 

This gives: 

S t+S = (l-fiS- X5)S t + fiSD t + \5(S? + 2S t E t ) + 0(5 2 ). 
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Similarly, 



D t+S = {1-115- X5)D t + fjiSSt + X5{D 2 + 2D t E t ) + 0(5 2 ), 

E t+S = (l-fi5- X5)E t + fi5E t + X5(E 2 + 2S t D t ) + 0(5 2 ). 

Rearranging these expressions and letting 5 — > produces the differential 
equation system: 

7 Q 

+ (A + n)S t = fiD t + X(S 2 + 25^); 

dD 



and 



— + (A + /i) A = fiS t + X(Df + 2D t E t ); 



^ + (A + n)E t = nE t + X(E 2 + 2S t D t ). 



Notice that we can use the relationship St + D t + E t = 1 to eliminate E t , 
and by writing u = At we obtain a two-dimensional autonomous differential 
equation system for S = S u , D = D u : 

-j- — f(S, D); d -^ = f{D 1 S) 1 
du du 

where: 

f(x, y) = (1 - p)x + py- 2xy - x 2 . 

Now, (S, D) is confined to the simply-connected, two-dimensional, compact 
region S, D > 0, S + D < 1, and we can analyse its dynamics using standard 
phase-portrait methods for autonomous two-dimensional dynamical systems 



(see e.g. |l8j). We note first, that (S,D) has no limit cycle by virtue of 
Dulac's criterion (with g(x, y) = 1/xy, for details see 18|). From the starting 
condition (P, D) = (1, 0), at u = t = 0, the quantity A u = S u — D u is non- 
negative and monotone decreasing, and (S, D) converges to an asymptotically 
stable equilibrium point, which can be found by solving the system ^ = 
^P- = and carrying out an eigenvalue analysis of the Jacobian of the system. 

Solving = — is equivalent to solving the pair of simultaneous 
quadratic equations f(s, d) = 0, f(d, s) = 0. Subtracting the second of these 
equations from the first gives: 

(s-d)(l-2p- s-d) = 0. (2) 
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Thus, either s = d or s + d = 1 — 2p. If s = d, then the equation f(s, d) = 
becomes s — 3s 2 = 0, which has two possible solutions: either s = d = | or 
s — d — 0; the first of these is asymptotically stable when p > 1/6. 

In the other case, where s + d — 1 — 2p, /(s, <f) = becomes: 

s 2 -(l-2p)s + p(l-2p) = 0, 

which also has two possible solutions: 

l-2p± y/(l-6p)(l-2p) 
S 2 

both of which are asymptotically stable with p < 1/6. Since s — d > (since 
A u > 0), the positive sign applies in the previous equation. Since in this case 
e = 1 — s — d = 2p, we have: 

s + \e= l -{l + v/(l-6p)(l-2p)). 



The results stated in the theorem now follow, since Eqn. ([[]) allows us to 
write: 

Pxt = \ + \a u (3) 

and so P t is monotone decreasing with t, and we also have the inequality 
|(1 + - 6p)(l - 2p)) > 1 - 3p when p < 1/6. □ 



It would be interesting to obtain corresponding results for maximum par- 
simony for more general models - particularly for symmetric models on more 
than two states (some limited results are described in jl7j], Sections 9.4.1 and 
9.5.1). Here we offer the following: 

Conjecture 2.4. For the r-state symmetric model Proposition \2.1\ general- 
izes to give a lower bound on P* of 1 — c r • Z + for some constant c r > 0. 
Similarly, Theorem \2.3\ generalizes to give an analogous result, where the 
critical ratio A/p = 6 is replaced by A/p = c' r for some constant c' r > 0. 
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3. Case II: Conservative GTR proceses 



For any Markov process, we often write fi(X t = j) or Pij(t) for the 
conditional probability F(X t = j\Xo = i) that X t = j given that Xo = i. We 
will say that a GTR model is conservative if, for every state i we have: 

Puit) > Pij(t) for all t > and all j ^ i. 

This is the 'forward inequality' described by Sober [lil ]. The Kimura two- 
parameter (K2P) model (and every submodel, such as Jukes-Cantor) is an 
example of a conservative process (see Fig. [2]). In this model the substitution 
probabilities are given as follow (for details see 0): 

f |(1 + e-"* + 2e-^ (K+1)/2 ), ifz = j; 
Piffl = { K 1 + e_M * ~ 2e^* (K+1)/2 ), if % -> j is a transition; 

[ e _At< ), if z — )■ j is a transversion. 

With a conservative model, the majority rule applies for ancestral recon- 
struction. Assuming state i at the tree root, the probability of observing % 
at any tree leaf is higher than the probability of observing any particular al- 
ternative state j. This holds true for whatever the root-to- leaf distances and 
the tree topology. With a star tree, with an upper bound on the root-to-leaf 
distances, the probability that this inference rule makes the correct selection 
tends to 1 as the number of leaves grow (by the central limit theorem for 
sums of independent random variables). We shall see that this result still 
holds for a more general class of trees under mild assumptions. We now 
describe this class of trees and their properties. 



3.1. Well-spread trees 

Given a rooted phylogenetic X-tree and a leaf x G X, let: 

(,:= £ („ 

e£P(p,x) 

the sum of the branch lengths on the path P(p, x) from the root of the tree 
(p) to leaf x, and where X is the set of n leaves. Similarly, for distinct leaves 



11 



Transition probabilities K2P 




1 2 3 4 5 6 

t 



Figure 2: The three substitution probabilities for the Kimura 2ST model. This model is 
conservative (but not monotone). In this example, k = 4 and fi is chosen to be 2/3 so 
that t corresponds to the expected number of substitutions. The curve descending from 
1 is the function pu(t). The middle curve, which has a local maximum around 1.6 is the 
probability of a transition (A <H» G or C O T ); the lower curve is the probability of a 
transversion (Purine (A or G) <-> Pyrimidine (C or T). 



x, y G X, let: 

lxy ^ lei 

eeP(p,x)flP(p,y) 

the total length of the shared paths from p to the leaves x, y. Finally, define 
the spread of T as: 



s(T) :-. 



Ex,y min {^' 1 } 



n(n — 1) 

Thus, provided l xy < 1 for all x,y, s(T) the average value of l xy over pairs 
x, y. We say that T is well spread if s(T) is small; more precisely, T is 1 — (3 
spread if s(T) < (3. In particular, a tree is a star tree if and only if it is 
1-spread. 

Note that a well-spread tree must have a large number of edges close to 
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its root; an example is shown in Fig. El 




(a) (b) (c) 



Figure 3: (a) A well-spread tree; (b, c) Trees that are not well-spread. 

It is easily shown that a sufficient condition for a tree to be 1 — /3 spread 
is that, for some e, 5 > with e + 5 < (3, the proportion of pairs of leaves 
whose paths from the root to those leaves overlap by a length of at least e is 
no more than S. We use this observation to show that the spread of a Yule 
pure-birth tree of fixed height t approaches 1 as the speciation rate grows. 

Proposition 3.1. Consider a random Yule pure-birth tree T that has speci- 
ation rate A and fixed height t. Then for any (3 > 0, the probability that T is 
1 — (3 spread converges to 1 as A — > oo . 



Proof. We may assume that T has a root of out-degree 2 (the length of a 
single lineages shrinks to zero with probability 1 as A grows). By Theorem 



2(2) of [10], the expected proportion of pairs of leaves whose most common 
ancestor lies r or more edges from the root of the tree has the geometric 
probability (2/3) r . Given e, 5 > with e+5 < (3 first select a sufficiently large 
value of r that (2/3) r < 5. For any r] > we can now select a sufficient large 
value of A that the probability that all the (at most) 2 r vertices separated 
from the root by r edges have are within distance e from the root is at least 
1 — -q. The result now follows. □ 



We now introduce some further notation. For each state j G <S, let nj 
denote the number of leaves of T that are in state j, and let pj = be the 
expected proportion of leaves that are in state j, given that the root is in 
state i. Thus nj is a random variable (whose distribution depends on the 
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root state i) while pj is a value determined by the model parameters, j and 
root state i. 



The following Lemma is central to many of the results that follow in this 
paper. 

Lemma 3.2. Suppose that T is a rooted tree, with branch lengths, and which 
is 1 — (3 spread. Then for any continuous-time Markov process on T, the 
following holds for all initial states i: For any s > 0, the probability of the 
event that for all states j G S: 



n j i 
n J 



< s 



is at least 1 — f{n,(3)/s 2 where fin, (3) tends to as max{^,/3} — > 0. 



Proof. For x G X = {l,...,n}, let 9 X be the random variable that takes 
the value 1 if leaf x is in state ?' and otherwise. We have ^ = - Y^™ , 91 
and pj = -EiLiPy'CJ- m particular, since Pij{l x ) = linearity of 

expectation gives: 



E 



n 



Pj. 



Now: 



Var 





= n~ 2 i 






. n . 





^Var[^ 



and Var[6y < \, \Cov{B x ,e. 



+ Cov[^,^]J, (4) 

< 1. Moreover, for any pair (x, y) we claim 
that \Cov(9 x ,6 y )\ < K min{l, l xy } for a constant K dependent only on the 
model. To see this, let N xy be the event that the root ancestral state does 
not change state anywhere along the shared path of length l xy . We have 
F(N xy ) = exp(— cl xy ) for a constant c dependent only on the model. More- 
over, the random variables 9 x ,9 y are conditionally independent, given iV^. 
Routine algebra then shows that we can express Cov[9 x ,9 y ] as l xy times a 
constant, plus terms of order l 2 xy . However, since in addition Cov^,^] < 1, 
we have Covf^, < mm{Kl xy , 1} < K mm{l xy , 1} for some sufficiently 
large constant K > 1. Thus, from|U we have: 



Var 



n 



< I - + Ks(T) 
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Let fi(n, (3) = (~\-K0) then, since s(T) < (3, and by Chebyshev's inequality, 
we have: 

F[\^- Pj \>s)<^Sl<f 1 (n^)/s 2 . 

Thus, if r = |<S| denotes the number of possible states, and if we let f(n, (3) := 
rfi(n,/3) then we have: 

p(3j:\^- Pj \> 8 ) <f(n,(3)/s 2 , 

which converges to zero when both n — > oo and (3 — > 0. □ 

Theorem 3.3. For a conservative model and a 1-/3 spread tree, with l x < I 
for each x, the probability that the majority state at the leaves is identical to 
the ancestral state at the root is at least 1 — g(n, (3, 1) for a function g which 
(for each value I) tends to zero as max{^,/3} — > 0. 

Proof. Let 

5i := min inf {p u (t) - Pij{t) : t G [0,/]}. 

Since p is continuous, and [0, 1] is compact, the conservative property implies 
that 5 1 > 0. Moreover, we have: 

P\ -P)> Si for all j jt i. (5) 

Now by Lemma 13.21 the probability of the event that |"~ — P}| < \$i f° r 
all j is at least 1 — 4f(n,f3)/8f. Moreover, for this event, Inequality (jSJ) 
implies (by the triangle inequality) that ^ — ^ > for all j ^ i; that is, 
the ancestral state % is the majority state at the leaves. Thus the probability 
that the ancestral state is the majority state is at least 1 — g(l,n,(3) where 
g(l,n,(3) := 4f(n,(3)/5f has the required stated properties. □ 

4. Case III: Monotone time-reversible proceses 

Note that, for any general time-reversible (GTR) Markov process the 
function pu(t) is always monotone decreasing to its equilibrium frequency i\i 
for each state % [2J, that is: 

Pu(t) > Pu(t') for all t < t'. 



15 



We will say the model is monotone if, for all distinct states i,j, we have: 



Pijit) < Pij(t') for all t < i! . 

Thus a monotone model has the property that if we start in a particular state 
% then the probability that we are in a different particular state j at time 
t increases monotonically with t towards its equilibrium probability ttj. In 
par ticular, a monotone GTR model satisfies the 'backward inequality' from 
|l5| that pu(t) > Pji(t) for all j ^ i and t > 0, since Pji(t) is monotone 
increasing to 7Tj while pu(t) is montone descreasing to 7Tj. 



For any number of states, models such as the Felsenstein 1981 model (also 
called the F81, Tajima-Nei, or Equal Input model) are monotone (but not 
conservative, unless all equilibrium frequencies are equal). Also any two- 
state Markov process is monotone (Fig. H]) and the implications of this for 
biological inference on the basis of a single observation (n — 1) were explored 
15 and 16 . 
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Figure 4: A two-state model with different equilibrium frequencies (0.75, 0.25) for the 
states and 1, respectively. The two decreasing curves are Poo(t) (upper) and pu(t) 
(lower). The two increasing curves are pw(t) (upper) and poi(t) (lower). This model is 
monotone, but not conservative. 

Amongst nucleotide substitution models, the K2P model is not monotone, 
since if % ^ j represents a transition then: 
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can behave as shown by the middle curve in Fig. [2J where k is the transition- 
transversion ratio (taken to be a default option of 4 here). 

Despite K2P not being monotone, this model nevertheless satisfies Sober's 
'backward inequality' as it is a symmetric model (i.e. Pij(t) = Pji(t) for all t); 
however more complex time-reversible continuous Markov processes can fail 
this inequality. For example, consider a process on states 0, 1, 2, . . . , m with 
equal and high transition rates from each value of k (less than m) to k + 1 
and equal low transition rates from each k (greater than 1) to A; — 1. Then for 
a suitably large value of m and choice of t = t\, we have pn(ii) < Poi(^i)- 111 
particular, observing state 1 at a particular (known) time t\ provides more 
evidence that the initial state was rather than 1. 

With monotone models, the majority rule can be misleading - when the 
time t is larger than the time corresponding to the intersection point pu(t) = 
Pij(t), it is more likely to have j ^ i at any given leaf than to have i. However, 
simple prediction rules still exist, which depend on what is known /unknown. 

When the equilibrium frequencies are known, we use the fact that if i is 
the ancestral state then the proportion of taxa in state i, — is expected to be 
larger than 7Tj (at least if the number of taxa is sufficient to avoid sampling 
effects), while — is expected to be less than 7Tj for all j ^ i. This suggests a 
modified majority rule: select as an ancestral state estimate the state i which 
maximizes 21 — 7Tj. Note that the branch lengths and even the tree topology 
do not need to be known. Moreover, this decision rule becomes the simple 
majority rule when the equilibrium frequencies are all equal. 

However, with site specific models, we cannot assume that the equilibrium 
frequencies are known, especially with proteins (as discussed earlier). In such 
a case, we can use a second decision rule, based on the fact that pu(t) is 
a decreasing function of t while Pij(t) is increasing. This second decision 
rule needs the root-to-leaf distances to be known and variable across taxa 
(however the tree topology may be unknown). Let lj be the average distance 
between the root and the taxa having state j, and let Z_j be the average 
distance between the root and the taxa having a state different to j. As the 
distance of a leaf from the root increases, the probability that leaf is in the 
ancestral state i should also decrease, while the reverse trend should hold for 
any other state j. In other words, we select i to minimize Zj — Note that 
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for this rule to apply, we need the root-to-taxon distance to be sufficiently 
heterogeneous. With a molecular clock-tree this rule is of no help. Moreover, 
we do not need to know the site rate and the absolute branch lengths, and the 
topology and branch lengths may be unknown, provided we still can estimate 
the root-to-leaf distances. 

We shall see that under mild assumptions, both rules for monotone models 
are statistically consistent. We now describe the two procedures for monotone 
models more precisely, depending on whether n is known or not. We then 
state a theorem that provides conditions under which these estimators are 
accurate. 

The two procedures are as follows: 

• it known: Select the ancestral state % to maximize — — tts. 

n 1 

• it not known: Select the ancestral state i to minimize b L — l_ im 

We will show that the first estimator performs well provided the tree is 
well spread and n is large. The second estimator requires, in addition, that 
there be reasonable spread amongst the root-to leaf distances (i.e. that they 
be not clocklike). First, we require a lemma which is a mild extension of 
Chebyshev's order inequality (the proof is given in the Appendix). 

Lemma 4.1. Suppose that Y is a random variable taking values in [0, 1] and 
that f : [0, I] — > R is a smooth function with f'(y) > c > for all y e [0, 1]. 
Then: 

Cov[Y,/(y)] > c- Var[y]. 
Similarly, if f'(y) < -c < for all y G [0, 1] then Cov[Y, f(Y)} < -c- Var\Y]. 

Theorem 4.2. Suppose we have a monotone GTR model and a > 0. 

1. The first estimation procedure described above (for a known n ) correctly 
selects the true ancestral state with probability at least 1 — a provided 
the following three conditions hold: 

(i) l x ^ I < oo ; for all x, and some I independent of n; 

(ii) T is 1 — j3 spread for sufficiently small values of (3, and 
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(Hi) n is sufficiently large. 

2. The second estimation procedure described above (for ir not known) 
correctly selects the true ancestral state with probability at least 1 — a 
provided that, in addition to conditions (i) - (Hi), the following two 
conditions hold: 

(iv) The variance of the l x values is greater or equal to some fixed value 
v > as n grows. 

(v) iTj G (0, 1) for all j G S. 

Proof. For part (1), let S\ = mirij —Pij(l)}, S2 = min i {p ii (Z) — 7Tj} and 
Si = min{5i, 5 2 }- By the monotonicity property, we have Si > 0. If i is the 
ancestral state then: 



Now by Lemma I3.2[ the probability of the event that | ^ — pj | < Si for all j 
is at least 1 — f(n, (3)/Sf. Moreover, for this event, Inequality (jHj) implies (by 
the triangle inequality) that — — 7Tj > and for all j ^ i, we have — — 7tj < 0, 
in which case the correct ancestral state (z) will be selected by the decision 
rule. Thus if we select a sufficiently small value of (3 and a sufficiently large 
value of n that 1 — f(n,{3)/Sf < a we obtain the result in Part (1). 

For part (2), we show that if i is the ancestral state then, with high 
probability, U — Z_j < and for all j ^ i, lj — l_j > 0. For any state j 
(including i), consider the difference: 



Recalling the definition of 8 J X from the proof of Lemma 13.21 we have: 



Pi > TTj + Si, and for any state j ^ i, p* < ttj — Si. 



(6) 



Dj :=lj-l 



and l-j 



(n - rij) 



and so: 



i 




n 



(7) 
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By assumption (v), Dj is well denned (i.e. n > rij > in the denominator) 
with probability converging to 1 as n grows. Let 

Z := - y~] l x , and let L := -Y] IxPijiQ- 
Notice that we can write the numerator of Dj in the form: 

(L-W + ilYsWi-A+M-^)- ( 8 ) 

Now, let ci = min j¥i inf {^|^ : t G [0, Z]} and c 2 = inf {^f^ : t G [0, Z]}, 
and c = min{ci, C2}. By the monotone assumption, c > 0. We can now apply 
Lemma 14.11 as follows. Define a random variable Y by setting Y = l x for a 
leaf x selected uniformly at random from the leaf set X, and let f(y) = Pij(y)- 
Then, Cov[Y, f{Y)} = L — lp % - and so, by Lemma [4.1[ we have: 

L — lp\< —cv, and L — lp l j > cv for all j 7^ i, (9) 

where v > is a lower bound on the variance of the l x values from condition 
(iv). Note that E[- ^2 xeX — L, and since l x < I for all x G X, an 
argument similar to that given in Lemma implies that |^ XLex ~~ ^\ 
can be made less than any 5 > by selecting (3 and - sufficiently small. 
Moreover, the same applies for the difference | ^ — Pj | by Lemma 13.21 Thus, 
from expression (jSJ), the numerator of -Dj can be made arbitrarily close to 
the difference L — lp l j by selecting (3 and ^ sufficiently small. It then follows 
from Inequality that the sign of Dj will be negative for j — % and positive 
otherwise, as required (noting that c depends just on the model, not on (3 or 
n). This completes the proof. 

□ 



5. Case IV: Non-monotone and non-conservative models 

Some simple and widely used models are neither monotone nor conserva- 
tive. For example, the 'HKY' (Hasegawa, Kishino and Yano) model combines 
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both K2P and F81 ([l9|); as with K2P, the transition probabilities first in- 
crease and then decrease (non- monotony) ; because the equilibrium frequen- 
cies may be unequal, the probability of observing the root state i at a leaf 
may be less than the probability of observing state j when TCi < ttj. 




Figure 5: HKY transition probabilities with standard parameter values (k = 4, 
purine=pyrimidine=0.5, GC = 70%). The asymptotic values 0.35 and 0.15 are the equi- 
librium GC and AT frequencies, respectively. The two decreasing curves are pu(t) (e.g. G 
— > G and C— > C for the top-most curve). The two increasing curves with local maxima 
are for transitions (e.g. A — >• G, T — >• C for the top-most increasing curve) while the two 
monotone increasing curves are for transversions. 

With such models, the justifications provided for the statistical consis- 
tency of the three simple rules above and parsimony no longer apply. How- 
ever, when the model is fully known and the tree is clock-like, the ancestral 
state can still be estimated using the frequencies of the characters at the tree 
leaves. We shall see that this method is statistically consistent. 

We first state a general result concerning general Markov processes. 

Lemma 5.1. Consider any continuous-time, irreducible Markov process, and 
let X t be the state at time t. Then for any given t > 0, the probability 
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distribution on X t determines both X and t. That is: 

Fi{X t = 3) = ¥ v {X tl = j) for all j eS =>i = i',t = t'. 

Proof. Let be the vector that has 1 in position i and otherwise, and 
define analogously. Now, the vector p l (t) := \Pi(X t = j) : j G S] satisfies 
p l {t) = ejexp(Qt); similarly we have p l (t') = e# exp(Qt'). Suppose values of 
t, t' exist for which p l (t) = p % (f). Without loss of generality, we may suppose 
that t > t' . In this case we have: 

(e* exp(Q(t - t')) - eiO exp(Qt') = 0. 

Moreover, since the process is irreducible, ejexp(Q(i — t')) can equal only 
if t — t' and % = i', so if this is not the case, we have wexp(Qt') = for a 
non-zero vector w which implies that 

detexp(Qt') = 0. 

But, by Jacobi's identity, detexp(Qt') = exp(tr(Q)f) > 0. This completes 
the proof. □ 

From this Lemma, it follows that for the very special case of a star tree 
with all edges of equal length we can use maximum likelihood to consistently 
infer the ancestral state. This is because, in this very special case, the states 
at the n leaves provide n i.i.d. samples of the process, and so the identifiabil- 
ity conditions required to estimate sq and fi hold (for similar reasons to the 
tailored argument for the consistency of MLE in settings such as phylogenetic 
tree reconstruction, described in Lemma 5.1 of [3|). Moving from star trees 
to the more general class of well-spread trees, we have the following main 
result of this section: 

Theorem 5.2. Suppose we have a continuous-time irreducible Markov pro- 
cess with rate matrix Q given, and let a > 0. Consider a rooted phylogenetic 
tree on n leaves, for which the branch lengths l e satisfy a molecular clock, i.e. 
lx = ^0 f or a tt x, where Iq is less than some known value I. Assume also that 
the tree is 1 — (5 spread. Then we can estimate the ancestral state sq correctly 
with probability at least 1 — a provided that n is sufficiently large, and (3 is 
sufficiently small. 
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Proof. We will establish this result by a procedure that selects the state i for 
which the entire probability distribution Pij(t) (as j varies) can be made the 
'closest' to the empirical distribution ^ for an optimal value of t. We will use 
the Zoo metric to measure 'closeness' (although, in applications other metrics 
may be preferable) so we will select the ancestral estimate % if % minimizes 
the quantity: 



inf max 

te[o,l] j 



n 



— -Pii(t) 



n 



First observe that, for any two states i' G S with i ^ i', if we let: 

Aw := inf wax.{\pij{t) - Pi>j(t')\}, 
t,t'e[o,i] j&s 



then Ajj/ > by Lemma 15.11 the compactness of [0, 1] and the continuity of 
p. Thus Si := mhijy.j^j/ A^i is also strictly greater than zero. Notice that 
Si is independent of n. Suppose that i is the true ancestral state and i' is a 
different state. By the molecular clock assumption, p* = Pij(lo)- Thus, by 
Lemma I3T2"| the probability of the event that — Pij{h)\ < \S\ for all j is 
at least 1 — 4f(n,(3)/Sf. Moreover, for this event: 



inf max 

te[o,i] j 



n 



Pij(t) 



< inf max 

t'e[o,i] j 



n 



1 /';■,(/'• 



n 



since the left-hand side is less than ^Si and if t' is the value that minimizes 
the right-hand side then, by the triangle inequality for the /oo metric: 



max 

j 



--Pi' jit') 
n 



> max \pij{t) 
j 



Pi'j(t')\—max 

j 



n 



1 1 
> A w --Si > -Si. 



Thus, the selection method will choose the correct ancestral state (i) with 
probability at least 1 — 4:f(n,(3)/Sf and, as before, this can be larger than 
1 — a by ensuring that - and (3 are sufficiently small. □ 



When the tree is non-clock like, and the model and branch lengths are 
known, we might use a standard ML approach based on the pruning algo- 
rithm, though the precise conditions required for statistical consistency seem 
less clear. 
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6. Simulations 



To compare the convergence rate and the performance of the various 
ancestral character reconstruction methods discussed in the previous sections, 
we performed computer simulations under biologically realistic conditions 
similar to Q. 

We first generated a Yule tree with n = 25, 50, 100, 200, 400, 800 and 
1600 leaves. This molecular-clock tree was then perturbed by multiplying 
every branch length (independently) by (1 + X), where X was an exponential 
variable with parameter 0.5. The factor (1 + X) was used (as opposed to, 
say, X) to avoid an excessive number of very small branches. The observed 
departure from the molecular clock, as measured by the ratio between the 
longest and shortest root-to-leaf lineages, was equal to ~ 3.5 on average, 
a value that is usual in published phylogenies. Finally, the whole tree was 
re-scaled so that the average root-to-leaf distance was uniformly distributed 
between 0.1 (relatively low divergence) and 1.0 (high divergence). 

DNA-like sequences of 100 sites were evolved along this tree using the 
HKY model with k = 4.0 (default value in most software) and the equilib- 
rium frequencies of A, C, G and T being equal to 0.15, 0.35, 0.35 and 0.15, 
respectively (such GC bias is observed in thermophilic bacteria and archaea, 
while Plasmodium species have an even stronger AT bias). The same pa- 
rameter values are used in Fig. [5j This HKY model was combined with a 
discrete gamma distribution of parameter 1.0 with six rate categories. We 
generated 500 data sets under these settings for each tree size n. 

Five ancestral character prediction methods were compared: 

• 'Parsimony' (studied in Section |2J); 

• 'Majority' (studied in Section [3]); 

• 'Modified majority', when the equilibrium frequencies ir are known 
(studied in Section 0], cf. part (1) of Theorem 14. 2P ; 

• 'Difference of average root-to- leaf distances', when the equilibrium fre- 
quencies 7r are unknown, but we know the root-to-leaf distances (stud- 
ied in Section 4, cf. part (2) of Theorem 14. 2P ; 
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• 'Presence', which involves drawing with equal probability one of the 
characters that are present at the tree leaves. Indeed, it frequently 
occurs (notably with small n) that not all four possible characters are 
observed at the tree leaves. Moreover, all previous prediction methods 
never output a character that is not seen at the tree leaves. This 
implies that the difficultly of the prediction problem depends on the 
number of extant character states, and thereby depends on n. In the 
extreme case where we observe a unique extant character, all methods 
achieve perfect predictions (unless hidden convergent substitutions), 
while when the four characters are observed the chance is 1/4 to be 
correct by chance. 'Presence' is thus used to re-scale the performance of 
the various methods, depending on n and the hardness of the prediction 
problem. 

All methods were run with perfect knowledge of the tree topology ('Par- 
simony'), equilibrium frequencies n ('Modified majority') or root-to-leaf dis- 
tances ('Difference of average root-to-leaf distances'). For each method and 
each data set, we measured: 

• The percentage of correct predictions; 

• The rescaled percentage of correct predictions, using the results achieved 
by 'Presence'. Let P be the percentage of correct predictions of the 
given method, and R be the percentage of correct predictions of 'Pres- 
ence'; the rescaled percentage of correct predictions is equal to (P — 
R) /( 1 — R) and measures the fraction of improvement brought by the 
given method compared to random predictions. 

Results averaged over 500 data sets are reported in Table 1 for each tree 
size n. We see that: 

• The results of 'Presence' indicates that the hardness of the prediction 
problem increases when n increases; with n = 25 the number of extant 
characters is around two on average, while it is around four with n = 
1600, meaning that the problem is 'twice as hard' with n = 1600 as 
compared with n = 25. 
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• The accuracy of all methods improves with large n. However, the 
rescaled percentage of correct predictions is required to see this effect 
with 'Difference of average root-to- leaf distances', which is the method 
with the slowest convergence rate. 

• Surprisingly, 'Parsimony' is slightly behind 'Majority' and 'Modified 
majority'. This finding is also observed with JC69 symmetrical model 
(results not shown), and thus cannot be attributed to the chosen sub- 
stitution model (HKY); it is likely due to the fact that some of the 
simulated trees show a high divergence, a condition where 'Parsimony' 
tends to perform poorly (see Theorem 2.3). 

• Both 'Majority' and 'Modified majority' are very close, while we ex- 
pected the latter to be better because it makes use of the equilibrium 
frequencies n. The explanation is likely related to the fact that in 
our simulations the root-to-leaf distance is less than 1.0 in average, a 
condition where HKY is basically conservative (cf. Fig. [5]) and thus 
'Majority' is consistent. However, we see a small superiority of 'Modi- 
fied majority' with large n, when the estimations of the n^/n frequencies 
become sufficiently reliable. Moreover, HKY is monotone up to ~ 1.45 
while it is conservative up to ~ 0.8 only. 

• Finally, the performance of 'Difference of average root-to-leaf distances' 
is rather low, but there is a clear improvement with large n. This con- 
firms that root-to-leaf distances bring substantial information, which 
could be combined with other standard approaches to enhance accuracy 
in difficult cases. 

All together, the most surprising outcome of these simulations is the per- 
formance of the (very simple) 'Majority' approach. It must be emphasized 
that 'Majority' does not use any additional knowledge (tree topology, root- 
to- leaf distances or equilibrium frequencies), meaning that the gap could be 
larger if the other methods (e.g. 'Parsimony') were used with only approxi- 
mate knowledge (e.g. tree topology). 
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Table 1: Average accuracy with simulated data. For each method we provide the per- 
centage of correct predictions and (within parentheses) the rescaled percentage of correct 
predictions (see text for definition). 500 data sets with 100 sites each were used for each 
number of taxa (n). Abbreviations are Mod. Majority: 'Modified majority', Diff. Aver. 
Dist.: 'Difference of average root-to-leaf distances'. 



n 


Parsimony 


Marjority 


Mod. Majority 


Diff. Aver. Dist. 


Presence 


25 


0.820 (0.652) 


0.832 (0.674) 


0.824 (0.659) 


0.609 (0.214) 


0.499 


50 


0.841 (0.728) 


0.852 (0.746) 


0.846 (0.736) 


0.570 (0.237) 


0.433 


100 


0.853 (0.772) 


0.863 (0.788) 


0.860 (0.784) 


0.539 (0.262) 


0.371 


200 


0.864 (0.802) 


0.870 (0.811) 


0.871 (0.813) 


0.521 (0.285) 


0.326 


400 


0.873 (0.822) 


0.880 (0.833) 


0.886 (0.842) 


0.522 (0.324) 


0.289 


800 


0.885 (0.844) 


0.885 (0.844) 


0.896 (0.858) 


0.537 (0.362) 


0.270 


1600 


0.890 (0.852) 


0.891 (0.853) 


0.906 (0.873) 


0.567 (0.410) 


0.261 



7. Discussion 

In this paper, we have described and analysed five approaches for inferring 
ancestral root state in taxon-rich trees: maximum parsimony, simple majority 
rule, modified majority rule, root-to-leaf differences, and best-fit of expected 
distribution of leaf states to the empirical distribution. The methods are all 
relatively simple and easily implemented, and require different model (and 
tree) assumptions in order to justify their accuracy. They can be applied 
in settings where one does not have enough information to carry out a full 
maximum likelihood analysis using the usual pruning algorithm, and so may 
be more suitable for site-specific models, where the process of evolution is 
likely to vary in a partially unknown way from character to character. 

The price one might expect to pay for a method that requires fewer as- 
sumptions or detailed knowledge of underlying parameters is lower accuracy. 
Nevertheless, we have described several results which show that these meth- 
ods (particular to the type of model in question) can still return the correct 
ancestral state provided that the number of taxa (n) is sufficiently large, and 
the tree is sufficiently well-spread. We have shown that for Yule trees with 
a high speciation rate (as a token for high taxon coverage), we expect a tree 
of fixed height to become increasingly well-spread as n grows. It is clear 
that some type of assumption on the spread of the tree is necessary to avoid 
having two long branches near the root and the majority of lineage splitting 
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well away from the root, in which case accurate root state inference is not 
possible. 

Except for maximum parsimony, the methods described do not use the 
tree topology explicitly (only the distribution of states at the leaves, and per- 
haps their distance from the root are employed) and so may be more robust 
to tree mis-specification. Of the class of models described monotone models 
are perhaps the most relevant for application, since most GTR models are 
likely to be monotone (and even conservative) when restricted to amounts of 
evolutionary change that are commonly encountered for sequence evolution. 

Our choice of methods to study in this paper has been guided by what can 
be usefully analysed, and we are not advocating these methods above others 
that might be considered; in particular, we make no claim that they are 
'best possible'. Indeed, if one has sufficient information then more standard 
approaches such as maximum likelihood would be preferable. However, the 
simplicity of these methods, and the fact that they are relatively robust 
to model mis-specification may make them a useful complement to more 
sophisticated approaches. It is also possible to develop statistical tests to 
determine whether differences observed in the data by our approaches are 
significant or not. For future studies, it would be worthwhile to explore the 
performance of these approaches on biological data-sets, comparing them 
with other alternative approaches that have been advocated; however. 
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9. Appendix: Proof of Proposition 12.21 and Lemma 14.11 



For the proof of Proposition 12.21 let n = e xt l 2 . Then the expected number 
of taxa at time t/2 is n and is n 2 at time t. Let N u be the number of 
individuals at time u. Let E\ be the event that Nt lies between \n and |n, 
let E 2 be the event that N t < 2n 2 , and let E be the conjunction of E\,E 2 . 
We first establish the following: 

CLAIM: For some 8 > 0, F(E) > 8, for all sufficiently large A. 

We have F(E) = F(E 2 \E 1 ) ■ P(£i). Now, the fact that N t/2 /e xt/2 has a 
limiting distribution as A tends to infinity (an exponential distribution with 
a mean of 1) implies that P(-Ei) > 8' > for a fixed 8' > (we can take for 8' 
any number smaller than e _ 2 — e~2 for large enough values of A). Moreover, 
^[N t \Ei\ < |n 2 , since E[N t \N t/2 — k) — ke xt/2 = kn < \n 2 for any k < 3n/2. 
However: 

E[iVt|£i] > 2n 2 ■ ¥(N t > 2n 2 \E l ) = 2n 2 (l - P(E 2 |^i)). 

Thus, P(E 2 |^i) > |, and so, P(£) > |5' =: 8 > as claimed. 

Suppose the number of individuals at time t/2 is m; label them 1,2, ... ,m. 
For individual i, let be the number of descendants at time t. Thus Y1T=\ n « 
is the total number of individuals at time t. Now we use a well-known prop- 
erty of the (discrete) Yule distribution - for a binary tree with rii leaves, the 
probability that the root is incident with a leaf is exactly 2/rij. Now indi- 
vidual i G {1, . . . , m) is not the root of a binary tree, but if the binary tree 
below i has the property just described, then either the edge i lies on, or an 
edge in the binary tree below it, has a length of at least t/A. Also if rii < 2 
then once again we must have at least one edge with a length of at least t/A. 

For any particular value of m that satisfies event E\, let p be the prob- 
ability that none of the m individuals gives rise in this way to an edge of 
length at least t/A. Then p is bounded above (by independence) as follows: 

m 2 

^nt 1 --)' ( iq ) 



31 



where the values satisfy constraints implied by E: 

m 

rij < 2n 2 , and m > -n, 



i=l 



as well as our assumption > 2 for all i. Maximizing the term on the 
right-hand side of f lTUl) subject to the constraint YlT=i n i — 2^ ''■> we nave: 

p < (1 _ „ e - m2 /™ 2 < e-°- 25 . 

2n 2 

Thus, with probability at least 5(1 — e _a25 ) there is an edge in the Yule tree 
having length at least i/4. This completes the proof of Proposition 12.21 

Proof of Lemma 14.11 Suppose f'{y) > c > for all y G [0, I] and that 
Y is discrete taking finite values I > y\ > y2 > ■ ■ ■ > y n > (other cases 
are similar), and let p(y) = F(Y = y). Then evaluating the following double 
sum by expanding out terms gives us the identity: 

X> - v*)(/(w) - f{yj))p{vi)p{vi) = 2Cov[r, /(y)]. (11) 

However we can also write this double sum in the form: 

2 Yl & ~ ys)Uivi) - f(vj)Mvi)p(vj) > 2c (v* - y 3 ) 2 p^)p(yj), (12) 

i,j:i>j -j 

where the inequality holds since, for yi > yj the condition f'(y) > c for all 
y G [0, 1] implies that f(yi) — f(yj) > c(yi — yj) by the mean value theorem. 
Now, 

2c ^ (yi - yj) 2 p(yi)p(yj) = c^(^ - yj) 2 p(yi)p(yj) = 2cVar[y]. 

i,j-i>j i,j 

Applying this to Eqns. ( ITT]) and ( |T2l) gives the result claimed. 
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